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Abstract: We consider nonparametric Bayesian estimation of a probabil- 
ity density p based on a random sample of size n from this density using 
a hierarchical prior. The prior consists, for instance, of prior weights on 
the regularity of the unknown density combined with priors that are ap- 
propriate given that the density has this regularity. More generally, the 
hierarchy consists of prior weights on an abstract model index and a prior 
on a density model for each model index. We present a general theorem on 
the rate of contraction of the resulting posterior distribution as ra — > oo, 
which gives conditions under which the rate of contraction is the one at- 
tached to the model that best approximates the true density of the obser- 
vations. This shows that, for instance, the posterior distribution can adapt 
to the smoothness of the underlying density. We also study the posterior 
distribution of the model index, and find that under the same conditions 
the posterior distribution gives negligible weight to models that are bigger 
than the optimal one, and thus selects the optimal model or smaller models 
that also approximate the true density well. We apply these result to log 
spline density models, where we show that the prior weights on the regu- 
larity index interact with the priors on the models, making the exact rates 
depend in a complicated way on the priors, but also that the rate is fairly 
robust to specification of the prior weights. 
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1. Introduction 

It is well known that the selection of a suitable "bandwidth" is crucial in non- 
parametric estimation of densities. Within a Bayesian framework it is natural 
to put a prior on bandwidth and let the data decide on a correct bandwidth 
through the corresponding posterior distribution. More generally a Bayesian 
procedure might consist of the specification of a suitable prior on a statistical 
model that is "correct" if the true density possesses a certainly regularity level, 
together with the specification of a prior on the regularity. Such a hierarchical 
Bayesian procedure fits naturally within the framework of adaptive estimation, 
which focuses on constructing estimators that automatically choose a best model 
from a given set of models. Given a collection of models an estimator is said 
to be rate-adaptive if it attains the rate of convergence that would have been 
attained had only the best model been used. For instance, the minimax rate of 
convergence for estimating a density on [0, l]"* that is known to have a deriva- 
tives is 7T,-"/(2a+<i)_ An estimator would be rate-adaptive to the set of models 
consisting of all smooth densities if it attained the rate n~"/^^"+'^^ w henever 
the true density is a-smooth, for any a > 0. (See e.g. iTsvbako^ |2004l |.) 



In this paper we present a general result on adaptation for density estima- 
tion within the Bayesian framework. The observations are a random sample 
Xi, . . . , Xn from a density on a given measurable space. Given a countable col- 
lection of density models 'Pn,ai indexed by a parameter a S v4„, each provided 
with a prior distribution Iln,a, and a prior distribution A„ on An, we consider 
the posterior distribution relative to the prior that first chooses a according 
to X„ and next p according to Iln,a for the chosen a. The index a may be a 
regularity parameter, but in the general result it may be arbitrary. Thus the 
overall prior is a probability measure on the set of probability densities, given 

by 

lift — y ] Xn,a^n,a- (1-1) 

Given this prior distribution, the corresponding posterior distribution is the 
random measure 

!pev„,^ YTi^iP^^i) ^n„,a {p) 

Of course, we make appropriate (measurability) conditions to ensure that this 
expression is well defined. 

We say that the posterior distributions have rate of convergence at least En 
if, for every sufficiently large constant M , as n — > oo, in probability, 

Un{d{p,PQ)> Men\Xi,...,Xn) ^0. 



Here the distribution of the random measure (|1.2p is evaluated under the as- 
sumption that Xi, . . . , Xn arc an i.i.d. sample from poi and d is a distance 
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on the set of densities. Throughout the paper this distance is assumed to be 
bounded above by the Helhnger distance and generate convex balls. (For in- 
stance, the Hellinger or Li-distance, or the L2-distance if the densities are uni- 
formly bounded.) Thus we study the asymptotics of the posterior distribution 
in the frequentist sense. 

The aim is to prove a result of the following type. For a given pq there exists a 
best model Vn.fj„ that gives a posterior rate en,/3„ if it would be combined with 
the prior n„,/3„ . The hierachical Bayesian procedure would adapt to the set of 
models if the posterior distributions (jl.2p , which are based on the mixture prior 
(jl.ip . have the same rate of convergence for this po, for any pq in some model 
Vn.a- Technically, the first main result is Theorem 12. II in Section [2] 

This sense of Bayesian adaptati on refers to the "fu l l" po sterior, both its 
centering and its spread. As noted bv lBelitser and Ghosal 2003 1 suitably defined 
centers of these posteriors would yield adaptive point estimators. 

The posterior distribution can be viewed as a mixture of the posterior dis- 
tributions on the various models, with the weights given by the posterior dis- 
tribution of the model index. Our second main result. Theorem 13.11 concerns 
the posterior distribution of the model index. It shows that models that are 
"bigger" than the optimal model asymptotically achieve zero posterior mass. 
On the other hand, under our conditions the posterior may distribute its mass 
over a selection of smaller models, provided that these can approximate the true 
distribution well. 

In the situation that there are precisely two models this phenomenon can 
be conveniently described by the Bayes factor of the two models. We provide 
simple sufficient conditions for the Bayes factor to select the "true" model with 
probability tending to one. This consistency property is especially relevant for 
Bayesian goodness of fit testing against a nonparametric alternative. A com- 
pu tationally advantageous met hod of such goodness of fit test was developed 
bv lBerger and Guglielmil 2001 using a mixture of Polya tree prior on the non- 
parametric alternative. Asymptotic properties of Bayes factors for nested reg- 
ular pa rametric model s have been well studied beginning with the pioneering 
work bv lSchwarz 1978l |. who also introduced the Bayesian information criterion. 
However, large sample properties of Bayes factors when at least one model is 
infinite dimensional appear to be unknown except in special cases. The paper 



Pass and Led j2004| showed consistency of Bayes factors when one of the mod- 



els is a singleton and the prior for the other model assigns positive probabilities 
to the KuUback-Leibler neighborhoods of the true densi t y, pop ularly known as 



the KuUback-Leibler property. The paper I Walker et al.l [2004l | showed (in par 



ticular) that if the prior on one model has the KuUback-Leibler property and 
the other does not, then the Bayes factor will asymptotically fav our the model 
with the KuUback-Leibler property. Unfortunately, the proof of iDass and 



20041 does not generalize to general null models and frequently b oth priors will 



have the KuUback-Leibler property, precluding the application of I Walker et al 



2004| . In Sections [3] and [4] we study these issues in general. 



The present paper is an extension of the paper iGhosal et al.1 [2003j . which 
studies adaptation to finitely many models of splines with a uniform weight on 
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the models. In the present paper we derive a result for general models, possibly 
infinitely many, and investigate different model weights. Somewhat surprisingly 
we find that both weights that give more prior mass to small models and weights 
that downweight small models may lead to adaptation. 

Related work on Bayesian adaptation was carried out by iHuangl 2004| . who 
considers adap tation using scales of finite-dimensional models, and Lember and 
van der Vaart (l2007l) . who consider special weights that downweight large mod- 
els. Our methods of proof borrow from [Ghosal et alJ (20001 . 

The paper is organized as follows. After stating the main theorems on adapta- 
tion and model selection and some corollaries in Sections [2] and [3l we investigate 
adaptation in detail in the context of log spline models in Section [H and we con- 
sider the Bayes factors for testing a finite- versus an infinite-dimensional model 
in detail in Section |3) The proof of the main theorems is given in Section [51 and 
further technical proofs and complements are given in Section [7l 



1.1. Notation 

Throughout the paper the data are a random sample Xi , . . . , Xn from a prob- 
ability measure Pq on a measurable space {X, A) with density po relative to a 
given reference measure fj, on (X^A). In general we write p and P for a den- 
sity and the corresponding probability measure. The Hellinger distance between 
two densities p and q relative to ^ is defined as h{p, q) = ||y^ — y^lb, for || • ||2 
the norm of L2{p). The e- covering numbers and e- packing numbers of a metric 
space {V,d), denoted by N{e,V,d) and D{e,V,d), are defined as the minimal 
numbers of balls of radius e needed to cover V, and the maximal number of 
e-separated points, respectively. 

For each n € N the index set is a countable set A„, and for every a G An 
the set Vn,a is a set of /z-probability densities on {X, A) equipped with a cr-field 
such that the maps {x,p) i-^ p{x) are measurable. Furthermore, Iln,a denotes a 
probability measure on Vn.a, and A„ = (A„^q: a G An) is a probability measure 
on An- We define 

Bn,a(£) = |pen..a:-Polog— <e',Poflog— <e2|, 

C„,a(e) = {p e Vn,a-- d{p,po) < e}. (1.3) 

Throughout the paper en,a are given positive numbers with En, a as n —^ oo. 
These may be thought of as the rate attached to the model Pn,a if this is 
(approximately) correct. 

The notation a < 5 means that a < Cb for a constant C that is universal 
or fixed in the proof. For sequences a„ and bn we write a„ <C 6„ if a„/6„ —^ 
and a„ 3> if a„ > for every n and liminf a„ > 0. For a measure P and a 
measurable function / we write Pf for the integral of / relative to P. 
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2. Adaptation 

For /3„ a given element of An, thought to be the index of a best model for a 
given fixed true density po, we split the index set in the indices that give a faster 
or slower rate: for a fixed constant H > 1, 



Even though we do not assume that An is ordered, we shall write a ^ Pn and 
a < /3„ if a belongs to the sets ^„.>/5„ or An,<j3^, respectively. The set A„ 
contains /?„ and hence is never empty, but the set yl„_<^^ can be empty (if /?„ 
is the "smallest" possible index) . In the latter case conditions involving a < /?„ 
are understood to be automatically satisfied. 

The assumpt i ons of the following theorem are reminiscent of the assumptions 
in lGhosal et al. 2000l |. and entail a bound on the complexity of the models and 



a co ndition on the concen tration of the priors. The complexity bound is exactly 
Ghosal et al.1 |2000t and takes the form for some constants Ea, 



as m 



sup log7vf|,C„.a(2e),d) < £;„ne2 ^, a e An. (2.1) 

The conditions on the priors involve comparisons of the prior masses of balls 
of various sizes in various models. These conditions are split in conditions on 
the models that are smaller or bigger than the best model: for given constants 

^n,a Iln.Q (C'n, a (*£«,»)) Li'^ne~ 



^n,f3„ Un^l3„{Bn^i3„{£n,l3„)) 
^n,f3„ n„_/3„ (-B„,/3„(en,/3„)) 



<Mn,ae^'"^"-, a</3„, z > /, (2.2) 
< Mn.ae^''""".'^", a>Pn, i>L (2.3) 



A final condition requires that the prior mass in a ball of radius en,a in a big 
model (i.e. small a) is significantly smaller than in a small model: for some 
constants /, B, 

An,a ^n,a{Cn,a{IBen,a)) ^ ^f^~2nel ^2 4) 

Let K be the universal testing constant. According to the assertion of LemmajHUl 
it can certainly be taken equal to X = 1/9. 

Theorem 2.1. Assume there exist positive constants B, Ea, L, H > 1, / > 2 
such that 112. 112. S^) . 112. 3\) and {2.4^ hold, and, constants E and E_ such that 
E > sup„g^^^„>^^ £;a4,a/4,,3„ o-nd, E > snp^^ji^.^^f^^ Ea (with E = if 
A„,</3„ = 9), 

B>Vh, KB^ > {HE)\/ E+l, B'^I'^{K -2L) > 3. 
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Furthermore, assume that J2aeA \fWn^ — exp[ne^ ^^]. If Pn € An for every n 
and satisfies nsf^ ^ oo, then the posterior distribution U.S^) satisfies 

Po"n„ (p: d{p, po)>IB en,p„ |Xi , • • • , X„) ^ 0. 
The proof of the theorem is deferred to Section [G] 

In many situations (although not in the main examples of the present paper) 
relatively crude bounds on the prior mass bounds (|2.2p . (|2.3p and (|2.4p are 
sufficient. In particular, the following lower bound is often useful: for a positive 
constant F, 

nn,0„ (-B„,/3„ (£„,/5„ )) > exp[-Fn4^^J. (2.5) 

This correspond to the "crude" prior mass condition of Ghosal et al. [2000l |. 
Combined with the trivial bound 1 on the probabilities Tin,a{C) in (|2.2p and 
(|2.3p . we see that these conditions hold (for sufficiently large /) if, for all a € An, 

< ^ln,c.e-^<-''<^r.) . (2.6) 



A 



This appears to be a mild requirement. On the other hand, the similarly adapted 
version of condition (|2.4p still requires that 

V ^n„.„(C„,„(/i3£„,„)) =o(e-(^+2)-?.,..). (2.7) 

,-\^n ^n,l3ri 

Such a condition may be satisfied because the prior probabilities 
^n,a{Cn,a{IBen,a)) ^rc vcry Small. For instance, a reverse bound of the type 
(|2.5p for a instead of /?„ would yield this type of bound for fairly general model 
weights Xn,a, since en,a > Hen,i3„ for a < Pn- Alternatively, the condition 
could be forced by choice of the model weights Xn,a, for general priors n„_Q. For 
instance, in Section [5.31 we consider weights of the type 

fiaexp[-CnelJ 



EaMQexp[-Cr: 



Such weights were also considered in Lember and van der VaartI (2007^ . who 
discuss several other concrete examples. For reference we codify the preceding 
discussion as a lemma. 

Lemma 2.1. Conditions Ii2.5\) . i2. 6'|J and \2.1\j are sufficient for 
and [K^ . 



Theorem 12.11 excludes the case that £„,/3„ is equal to the "parametric rate" 
1/y/n. To cover this case the statement of the theorem must be slightly adapted. 
The proof of the following theorem is given in Section [6l 

Theorem 2.2. Assume there exist positive constants B,Ea,L < K/2,H > 
1,1 such that h2.1]) . h2.A^) . i2.S\) and \2.4^ hold for every sufficiently large /. 
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Furthermore, assume that X^aeA \/t^n,a ~ 0{1). If /3„ G An for every n and 
£n,/3„ = then the posterior distribution il.S^) satisfies, for every In — > oo, 

-Po"nn(p: d{p,po) > ■ ■ ■ , Xn^ — > 0. 

For further understanding it is instructive to apply the theorems to the sit- 
uation of two models, say Vn,i and Vn.2 with rates > £„,2- For simplicity 
we shall also assume (|2.5p and use universal constants. 

Corollary 2.1. Assume that \2.1]) holds for a G j4„ = {IjS} and sequences 

, 2 2 

(1) // n„_i(i3„^i(e„.i)) > e~"^".i and \n,2/^n,i < e""^" !, then the posterior 
rate of contraction is at least Sn,i- 

(2) If n„.2(-B„^2(£n,2)) > e~"^".2 arid Xn^l^nS > e~"^'>'i and, moreover, 
n„,i (C„^i(/£„_i)) < (A„_2/A„,i)o(e~"^"'^"'2) for every I, then the posterior 
rate of contraction is at least e„,2- 

Proof. We apply the preceding theorems with /3„ = 1, An.<p„ = and A„_>^^ = 
{1, 2} in case (1) and /?„ = 2, A„^</3^ = {1} and ^„,>/5„ = {2} in case (2), both 
times with H — I and = /i„^2 = 1- ■ 

Statement (1) of the corollary gives the slower en,i of the two rates under 
the assumption that the bigger model satisfies the prior mass condition ()2.5p 
and a condition on the weights Xn,i that ensures that the smaller model is not 
overly downweighted. The latter condition is very mild, as it allows the weights 
of the two models too be very different. Apart from this, statement (1) is not 
surprising, and could als o be obtained from nonadaptive results on posterior 
rates of contraction, as in lGhosal et al.l |200Cll |. 

Statement (2) gives the faster rate of contraction e„^2 under the condition 
that the smaller model satisfies the prior mass condition (|2.5p . an equally mild 
condition on the relative weights of the two models, and an additional condi- 
tion on the prior weight n„.i(C„^i(/e„^i)) that the bigger model attaches to 
neighbourhoods of the true distribution. If this would be of the expected order 
exp(— _Fn£^ i), then the conditions on the weights A„^i and A„^2 m the union of 
(1) and (2) can be summarized as 

An,l 

This is a remarkably big range of weights. One might conclude that Bayesian 
methods are very robust to the prior specification of model weights. One might 
also more cautiously guess that rate-asymptotics do not yield a complete picture 
of the performance of the various priors (even though rates are considerably 
more informative than consistency results). 

Remark 2.1. The entropy condition i2.1\) can be relaxed to the same condition 
on a submodel V'^ „ C Vn,a that carries most of the prior mass, in the sense 
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that 

= o(e "./3„ j_ 



This follows, because in that case the posterior will concentrate on UaV^ ^ (see 



Ghosal and van der Vaari \2007(\ ]. Lemma 1). This relaxation has been found 
useful in several papers on nonadaptive rates. In the present context, it seems 
that the condition would only be natural if it is valid for the index (3n that gives 
the slowest of all rates £n,a- 

3. Model Selection 

The theorems in the preceding section concern the concentration of the posterior 
distribution on the set of densities relative to the metric d. In this section we 
consider the posterior distribution of the index parameter a, within the same 
set-up. The proof of the following theorem is given in Section [6l 

Somewhat abusing notation (cf. (|1.2p ). we write, for any set B C An of 
indices, 

n„(i?|Xi, . . . , Xn) 



Theorem 3.1. Under the conditions of Theorem \2.1\ 

Po"n„(A„,<^jXi,--- ,x„) ^0, 

P(5^n„(a e An>p„:d{p^,Vn^o) > • • • ,X„) ^ 0. 

Under the conditions of Theorem \2.S\ this is true with IB replaced by In, for 
any In — > oo. 

The first assertion of the theorem is pleasing. It can be interpreted in the sense 
that the models that are bigger than the model 7^n,/3„ that contains the true 
distribution eventually receive negligible posterior weight. The second assertion 
makes a similar claim about the smaller models, but it is restricted to the 
smaller models that keep a certain distance to the true distribution. Such a 
restriction appears not unnatural, as a small model that can represent the true 
distribution well ought to be favoured by the posterior: the posterior looks at the 
data through the likelihood and hence will judge a model by its approximation 
properties rather than its parametrization. That big models with similarly good 
approximation properties are not favoured is caused by the fact that (under 
our conditions) the prior mass on the big models is more spread out, yielding 
relatively little prior mass near good approximants within the big models. 

It is again insightful to specialize the theorem to the case of two models, and 
simplify the prior mass conditions to (|2.5p . The behaviour of the posterior of 
the model index can then be described through the Bayes factor 

_ A„,2/nLiP(^0n„,2(p) 

" A„,i/nr=iP(^on„,i(p)- 
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Corollary 3.1. Assume that \2.1]) holds for a G = {1,2} and sequences 

(1) //n„4(B„,i(£„4)) > e-""-i and \n,2/K,i < e"^".i and d{pQ,Vn,2) > 
In£n.i for every n and some In — » oo, then BF„ in Pq -probability. 

(2) //n„,2(S„,2(e«,2)) > e-'<-- and A„^2/A„^i > e-"^".i and n„4(C„,i x 

{Ien,i)) < (A„,2/A„,i)o(e~^"^".2) for every I, then BF„ ^ oo in Pq"- 
probability. 

Proof. The Bayes factor tends to or cx3 if the posterior probability of model 
'Pn,2 or Vn,i tends to zero, respectively. Therefore, we can apply Theorem 13.11 
with the same choices as in the proof of Corollary 12. II ■ 

In particular, if the two models are equally weighted (An.i = A„_2), the models 
satisfy (|2.1[) and the priors satisfy (|2.5p . then the Bayes factors arc asymptoti- 
cally consistent if 

d{po,'PnA) > e«,i> 

4. Testing a Finite- versus an Infinite-dimensional Model 

Suppose that there are two models, with the bigger models Vn,i infinite di- 
mensional, and the alternative model a fixed parametric model Vn,2 — P2 = 
{pe'-O e 9}, for 6 C W^, equipped with a fixed prior n„^2 ~ ^2- Assume that 
Ara,i = An^2- We shall show that the Bayes factors are typically consistent in this 
situation: BF„ ^ oo if po € '^2, and BF„ ^ if po ^ '^2- 

If the prior 112 is smooth in the parameter and the parametrization 9 1-^ pg 
is regular, then, for any Oq E and e ^ 0, 

n2 (e-. Pe, log ^ < e\ Pe„ flog < e^) ~ Cg.e". 

V p0 \ pe / / 

Therefore, if the true density pq is contained in V2, then n„^2 (-Bn,2(£n,2)) ^ 
^n,2: which is greater than exp[— ne^ j] fo^' ^n,2 = Dlogn/n, for D > d/2 and 
sufficiently large n. (The logarithmic factor enters, because we use the crude 
prior mass condition (|2.5p instead of the comparisons of prior mass in the main 
theorems, but it does not matter for this example.) 

For this choice of en,2 we have expfne^ 2] — Therefore, it follows from 
(2) of Corollarv l3.11 that if po G ^2, then the Bayes factor BF„ tends to 00 as 
n ^ 00 as soon as there exists e„.i > en,2 such that 

n„,i(p:d(p,po) < l£ni) = oin-^^). (4.1) 

For an infinite-dimensional model Vn,i this is typically true, even if the models 
are nested, when po is also contained in Vn,i- In fact, we typically have for 
Po G 'Pn,! that the left side is of the order exp[— Fne^ -j] for en,i the rate attached 
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to the model Vn.i- As for a true infinite-dimensfonal model this rate is not 
faster than n^" for some a < 1/2, this gives an upper bound of the order 
exp(— i^n^"^"), which is easily o{n~^^). For pq not contained in the model 
Vn,i, the prior mass in the preceding display will be even smaller than this. 

If po is not contained in the parametric model, then typically d(j)Q,V2) > 
and hence d{pQ,V2) > InSn,i for any e„^i — > and sufficiently slowly increasing 
/„, as required in (1) of Corollarv l3.1l To ensure that BF„ 0, it suffices that 
for some e„^i > £„^2, 

n„,i (p: Po fog ^ < 4,1,^0 (fog < > e"'""'^- (4-2) 



This is the usual prior mass condition fcf. lGhosal et al. 2000l |') for obtaining the 



rate of convergence e„_i using the prior n„_i on the model Vn,i- 

We present three concrete examples where the preceding can be made precise. 

Example 4.1 (Bernstein-Dirichlet mixtures). Bernstein polynomial densities 
with a Dirichlet prior on the mixing d i stribution and a geometric or Poisson p rior 
on the order a re described in lPetron^ 1999| . Petrone and Wasserman 2002| and 



Ghosall [20011 



If we take these as the model Vn,i and prior n„^i, then the rate Sn.i is equal 



to En,! = n~^/^(logn)^/^ and (|4.2p is satisfied, as shown in iGhosai [200l| . As 
the prior spreads its mass over an infinite-dimensional set that can approximate 
any smooth function, condition (|4.ip will be satisfied for most true densities po. 
In particular, if fc„ is the minimal degree of a polynomial that is within Hellinger 
distance n"-^/^ \ogn ofpo, then the left side of (|4.ip is bounded by the prior mass 
of all Bernstein-Dirichlet polynomials of degree at least kn, which is e^'^*^" for 
some constant c by construction. Thus (|4.ip is certainly satisfied if fc„ ^ logn. 
Consequently, the Bayes factor is consistent for true densities that are not well 
approximable by polynomials. 

Example 4.2 (Log spline densities). Let Vn,i be equal to the set of log spline 
densities described in Section [5] of dimension J ^ n^/i^'^+'^) ^ equipped with the 
prior obtained by putting the uniform distribution on [~AI,Aiy on the coeffi- 
cients. The corresponding rate can then be taken en,i = n^"/(^"+^^\/Iogri (see 
Section [nH]). Conditions (|4.ip and (|4.2p can be verified easily by computations 
on the uniform prior, after translating the distances on the spline densities into 
the Euclidean distance on the coefficients (see Lemmas 17.41 and 17.61) . 



Example 4.3 (Infinite dimensional normal model). Let Pn,i be the set of 
N^{e, /)-distributions with 9 = (6'i, 6*2, . . .) satisfying J2Zi *^"^f < ^- (Thus a 
typical observation is an infinite sequence of independent normal variables with 
means 9i and variances 1.) Equip it with the prior obtained by letting the 9i be 
independent Gaussian variables with mean and variances Take Vn,2 

equal to the submodel indexed by all 9 with 9i = for alH > 2, equipped with 
a positive smooth (for instance Gaussian) prior on 9i . This model is equivalent 
to the signal plus Gaussian white noise model, for which Bayesian procedures 
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were studied in iFreedmaiil Il999ll IZhaol |2000l |. iBelitser and Ghosall |2003l | and 



Ghosal and van der VaartI [2007a|. 



The KuUback-Leibler and squared HeUinger distances on Vn,i are, up to 
constants, essentially equivalent to the squared £2- distance on the parameter^ , 
when the distance is bounded (see e.g. Lemma 6.1 of Belitser and Ghosall 2003l |). 
This allows to verify (|4.1|) by calculations on Gaussian variables, after truncating 
the parameter set. For a sufhciently large constant M, consider sieves V^^ i = 

E»=i ^^'''^f < M} for some q' < q, and 2 ^{\Oi\<M}, respectively. The 
posterior probabilities of the compl ements of these sie ves are small in probability 
respectively b y Lem ma 3.2 of Belitser and Ghosall [2003J and Lemma 7.2 of 
Ghosal et al] |2000j . Hence in view of Remark 12.11 it suffices to perform the 
calculations on V'„ i and "P^.a: 

By Lemmas 6.3 and 6.4 of Belitser and Ghosall 2003l | it follows that the con- 
ditions of (2) of Corollary mi] holds for en,i = max(n-'?'/(29'+i), n^^/^^g+i)) ^ 

n"' provided that (|4.1|) can be verified. Now, for any 60 € ^2, 



i=l 



< £71 



i)< 11(2*0 



9+1/2 



£n,l) - 1). 



For i < /((29+i)(2<; +1)) ^j^g argument in the normal distribution function is 
bounded above by 1, and then the corresponding factor is bounded by 2<i>(l) — 
1 < 1. It follows that the right side of the last display is bounded by a term of 
the order e~'^ " ' >< « + some positive constant c. This easily shows 

that (|4.ip is satisfied. 



5. Log Spline Models 



Log spline density models, introduced in IStonel [l99f| . are exponential families 
constructed as follows. 

For a given "resolution" K E N partition the half open unit interval [0,1) 
into K subintervals [{k — 1)/K,k/K^ for k ~ 1,...,K. The linear space of 
splines of "order" q G N relative to this partition is the set of all continuous 
functions /: [0, 1] ^ E that are q — 2 times differentiable on [0, 1) and whose 
restriction to every of the partitioning intervals [(fc— 1) /K, k/K) is a polynomial 
of degree strictly less than q. It can be shown that these splines form a J = 
q + K — 1-dimensional vect or space. A con venient basis is the set of B-splines 
i3j 1, . . . , Bj j, defined e.g. in lde BooJ |200lj . The exact nature of these functions 
does n ot matter to us here, but the following properties are essential (cf. Ide Boor 
200lj . pp 109-110): 



• Bj^, >0, j = l,...,J 

• Bjj is supported on an interval of length q/ K 

• at most q functions Bjj are nonzero at every given x. 
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The first two properties express that the basis elements form a partition of unity, 
and the third and fourth properties mean that their supports are close to being 
disjoint if K is very large relative to q. This renders the B-spline basis stable for 
numerical computation, and also explains the simple inequalities between norms 
of linear combinations of the basis functions and norms of the coefficients given 
in Lemma 17.21 below. 

For eeR-^ let O^Bj = J2j ^jBj^j and define 



Jo 

Thus pj^0 is a probability density that belongs to a J-dimensional exponential 
family with sufficient statistics the B-spline functions. Since the B-splines add 
up to unity, the family is actually of dimension J — 1 and we can restrict 9 to 
the subset of 6* e E'' such that O'^l ^0. 

Splines possess excellent approximation properties for smooth functions, where 
the error is smaller if the function is smoother or the dimension of the spline 
space is higher. More precisely, a function / e C" [0,1] can be approximated 
with an error of order (1/J)" by splines of order q > a and dimension J. Be- 
cause there are J — 1 free base coefficients, the variance of a best estimate in a 
J-dimensional spline space can be expected to be of order J/n. Therefore, we 
may expect to determine an optimal dimension J for a given smoothness level 
a from the bias-variance trade-off J/n ^ (1/J)^". This leads to the dimension 
Jn,a n^^^'^°''^^\ and the "usual" rate of convergence n""/^^""'"^-'. 

This infor mal calculation was justified for maximu m likelihood and Bayesian 
estimators in IStonel [l99f| and (chosal et al.l [200(Tj . respectively. The paper 



IStond [l99nj showed that the maximum likelihood estimator of p in the model 
{Pj,e- J = Jn,a,0 e M"'"'", 0^1 = 0} achieves the rat e of convergence 
if the true density po belongs to C"[0, 1]. The paper Ghosal et al. 200d | showed 



that a Bayes procedure with deterministic dimension Jn^a and a smooth prior 
on the coefficients 9 E K'^" ° achieves the same (posterior) rate. (In both papers 
it is assumed that the true density is also bounded away from zero.) 

Both the maximum likelihood estimator and the Bayesian estimator described 
previously depend on a. They can be made rate-adaptive to a by a variety of 
means. We shall consider several Bayesian schemes, based on different choices 
of priors Iin,a on the coefficients and A„ on the dimensions Jn^a of the spline 
spaces. Thus Yln,a will be a prior on M'^" " for 

J_ = [nV(2"+i)j, (5.1) 

the prior Iln,a on densities will be the distribution induced under the map 
9 I— > pj,e, where J = Jn,a, and A„ is a prior on the regularity parameter a. 

We always choose the order of the splines involved in the construction of the 
ath log spline model at least a. 

We shall assume that the true density po is bounded away from zero next 
to being smooth, so that the Hellinger and L2-metrics are equivalent. We shall 
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in fact assume that uniform upper and lower bounds are known, and construct 
the priors on sets of densities that are bounded away from zero and infinity. It 
follows from Lemma [7731 that the latter is equivalent to restricting the coefficient 
vector 9 in pj^g to a rectangle [—M,My for some M. We shall construct our 
priors on this rectangle, and assume that the true density po is within the range 
of the corresponding spline densities, i.e. || logpo||oo < Qi-^ fo'" the constant C4 
of Lemma 17.31 Extension to unknown AI thro u gh a second shell of adaptation 
is possible (see e.g. iLember and van der VaartI [2007j), but will not be pursued 
here. 

In the next three sections we discuss three examples of priors. In the first 
example we combine smooth priors IIn,Q on the coefficients with fixed model 
weights Xn,a = Aa . These natural priors lead to adaptation up to a logarithmic 
factor. Even though we only prove an upper bound, we believe that the logarith- 
mic factor is not a defect of our proof, but connected to this prior. In the second 
example we show how the logarithmic factor can be removed by using special 
model weights Xn,a that put less mass on small models. In the third example we 
show that the logarithmic factor can also be removed by using discrete priors 
ILn,a for the coefficients combined with model weights that put more mass on 
small models. One might conclude from these examples that the fine details of 
rates depend on a delicate balance between model priors and model weights. 
The good news is that all three priors work reasonably well. 

5.1. Flat priors 

In this section we consider prior distributions II„^ct that possess Lebesgue densi- 
ties on {9 € R"'"'° : 6*^1 = 0} that vanish outside a big block [-M, M]"'"'° and are 
bounded above and below by d'^"-" and respectively, for given constants 

< d < D < 00. We combine these with fixed prior weights Xn,a — fia > on 
the regularity parameter, which we restrict to A = {a ^ Q+: a > a} for some 
(known) constant a > and assume to satisfy X^qga ^/J^ 

Theorem 5.1. If po £ C^[0,1] for some (3 £ n [a, 00) and \\ logpolloo < 
C_^M , then there exist a constant B such that PQlln(^p: \\p~po\\2 > Ben,p\Xi, . . . , 
X„) ^ 0, for en,0 = 

Proof. The dimension numbers Jn,a defined in (|5.ip relate to the present rates 
£n,a- = ri""/(2"+i)^logn as J^^a logn ne\^^. 

By Lemma lTTTl condition (|2.ip is satisfied for any Sn^a such that ne^ ^ > Jn,a, 
and hence certainly for the present The constants Ea do not depend on a, 
and hence both E and E_ in Theorem 1 2 . II can be taken equal to a single constant 
E. 

Because ||logpo||oo < by assumption, the Hellinger distance of po to 

Vj is bounded above by a multiple of J^^ by Lemma 17.81 By Lemma 17.61 for 
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£n,/3 ^ "^n"^' some constants A and A, and sufficiently large n, 



n 



n,f3 



Because 0j S Oj by its definition, the set {9 G Qj: \\6 ~ dj\\2 < e} contains at 
least a fraction 2~'^ of the volume of the ball of radius e around 9j, even though 
it does not contain the full ball if 9j is near the boundary of Qj. It follows that, 
for any a, f3 €z A and e, and vj the volume of the J-dimensional Euclidean ball. 



n„,«(C„,a(ie)) ^ {ADie^Jn,a) 



< 



for suitable constants a and a, in view of Lemma 17.91 
li a < [3, then with e — inequality (|5.2p yields 



(5.2) 



= exp 
< exp 



J, 



n,aUog{aien,a) " log(ae„,/3)") 
>/«,Q log(ai) + loga| + loge„^Q - ^ loge„,^ 



because J„^a > HJn^p for a < /3. Here 
1 



loge^ 



H 



log£„,/3 



(i 



log n H — 

2/3+1 2q + 1^ ^ 2 



1 



log logn. 



For sufficiently large H the coefficient of log n is negative, uniformly in a > a. 
Condition p.2p is easily satisfied for such i?, with yLtn,Q = ^a/ ^-p and arbitrarily 
small i > 0. 

If a < /3, then with e = IBen,a, inequality (j5.2p and similar calculations 
yield 



„2ne 



2 ^ II„^ct (C„^Q.(/i3e„^Q)) 

n„,/3(i?„,/5(£„,^)) 



< exp 

< exp 



log(a/B) + log en,a 



\og{aen,f3) + 2-^ \ogn 
1 ,, , . 1 , ,1 



Jn.a. 



'Jn,a 



Jn,a{^og{aIB) + -^|loga| + loge„,Q - logeri,/3 + 2j^ log 71 



By the same arguments as before for sufficiently large H the exponent is smaller 
than — J„ Q,clogn for a positive constant c, uniformly in a > a, eventually. This 
implies that ([2^ is fulfilled. 
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With e — En, 13, inequality (|5.2|1 yields 



= exp 



'/n,/3(y^(log(ai) + logenjj) ~ log(a£„,0)] 



If a > /3 the right side is bounded above by 

exp[j„,^(i7| log(ai)| - log(ae„,;3))] < exp[j„,^(logn)Li^] , 

for sufficiently large z, where L may be an arbitrarily small constant. Hence 
condition (|2.3p is fulfilled. 

The theorem is a consequence of Theorem 12. 11 with An of this theorem equal 
to the present A. ■ 

5.2. Flat priors, decreasing weights 

The constant weights Xn,a = used in the preceding subsection resulted in an 
additional logarithmic factor in the rate. The following theorem shows that for 
A = {Q:i,a2, . . . jCkAr} a finite set, this factor can be removed by choosing the 
weights 

A„,acx Yl (C£„,^)^"-. (5.3) 



These weights are decreasing in a, unlike the weights in ()2.8|) . Thus the present 
prior puts less weight on the smaller, more regular models. We use the same 
priors Hn.a on the spline models as in Section [5TT1 

Theorem 5.2. Let A be a finite set. If po E C^p, 1] for some f3 £ A and 
II logpoljoo < and sufficiently large C , then there exist a constant B such 

that P^n„(p: lb - Poll 2 > BenAXu . . . , X„) ^ for e„,^ = n-^/^^/^+D. 

Proof. Let Sn.a — rt^"/(^"+^), so that J„^q, ~ "e^,a, and Jn,a' / Jn,a <C n""^ 
for some c > whenever a' > a. Assume without loss of generality that A = 
{«!, . . . , ajv} is indexed in its natural order. 

If r < s, and hence < Ofg, then, by inequality (j5.2p . 

Xn,ar^n,ar i^^n^a^ i,^^n,ar )) 
^n^as^n^ots {j^n,as (^^,cks )) 



s-1 



< exp 



exp 



Jn,ar ( log(aze„ ) - ^ log(a£„.„J -Y^'^ log(Ce„,„ J 

' k—r ' 

ai J J 
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The exponent takes the form Jn,ar-{^og{ai/C) +o(l)). Applying this with = 
a < (3 — as, we conclude that l|2.2p holds for every C and fin,a = 1, for 
sufficiently large i, for an arbitrarily small constant L, eventually. 
Similarly, again with ar = a < P — as, 



e 



2 AtI^Q^IIt^.Q^ ^C^i^Q^ (/-SSj^^Q;^)^ 

^n,aB ^71. as (-^n,as {^n,as )) 

exp|^J„,Q^ (^log(^^^^^ - log(aen,aJ 



faIB\ Jn,as 
C ) Jn.oir 

"'^ T r>T 



■l0g(C£n,aJ 



k—r+1 ' ' 

This tends to if C > alB. Hence, for C big enough, condition (|2.4|) is fulfilled 
as well. 

Finally, choose ar — f3 < a = as and note that 



< exp 



exp 



J J 

Jn,a,A (fog(a^gn,ar) — log( ) + ^±--log(C£„,„J 

k—r 

>/«,Q,(y^log(az£„,Q,J + logf— W ^ log(Ce„,Q J 



Here the exponent is of the order J„_q,^ (log(C/a)+o(l) logi + o(l)). We conclude 
that the condition ()2.3p holds. 

The theorem is a consequence of Theorem 12.11 with An of this theorem equal 
to the present A. m 



The proof of Theorem 15.21 relies on the fact that Jn,a^ ^ Jn,as s < r, and 
for that reason it does not extend to the more general case of a dense set A as 
considered in Theorem 15.11 On the other hand, it would be possible to extend 
the theorem to a countable totally ordered set ai < 0:2 < • • • by using the 
weights (15. 3p restricted to sets ai < a2 < • ■ ■ < aM„ for Af„ t 00. 

The existence of rate-adaptive priors that yield the rate without log-factor 
in the general countable case is subject for further research. There are some 
reasons to believe that this task is achievable with some more elaborate priors 
as these in (|5.3p . For example, one could consider more general priors than (|5.3p 
of the type 

The truncation-set An as well as the constants C-y^a and A^ ^ must be carefully 
chosen. 
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5.3. Discrete priors, increasing weights 

In this section we choose the priors Iin,a to be discrete on a suitable subset of 
the J„,.„-dimensio nal log spUne densities, const r ucted as follows. 

According to Kolmogorov and Tihomirov 196lj (cf. Theorem 2.7.1 in 



van der Vaart and Wellneij [l996| |) the unit ball Cf[0, 1] of the Holder space 
C"[0, 1] has entropy log A^(e, Cf [0, 1], II • ||oo) of the order as e i 0, 

relative to the uniform norm. Then it follows that there exists a set of Nn.a S 
(Af/e„^ct)^/" functions /i, . . . , /jv„ „ such that every / with Holder norm smaller 
than a given constant M is within uniform distance En, a-— of 
some fi. These functions can without loss of generality be chosen with Holder 
norm bounded by M. By the approximation properties of spline spaces (cf. 
LemmaEB), we can find 6*, e R-^" " such that H^f -/j||oo < Cg,a(l/ J„,a)". 
Define Iln,a to be the uniform probability distribution on the collection 9i, . . . , 

We combine the resulting prior Iln.a on log spline densities with model 
weights on the index set A — Q+ of the form (|2.8|) . where {^a'-ct G Q^) is 
a strictly positive measure with X]aeQ+ < ^ ^i^d C is an arbitrary posi- 
tive constant. 

Theorem 5.3. If pq G C^[Q, 1] for some j3 € A and \\ logpo||/3 < M , then there 
exist a constant B such that P^IVn{p:h(p,pti) > i?£„^^|Xi, . . . , X„) — ^ for 

Proof. By construction there exists an element 6i in the support of Tln,p such 
that 

II logpo - SfBj^ Jloo < en,0 + {l/Juji}^ ^ £n,/3- 

It follows that the function e^' ^'"-^ is sandwiched between the functions pge"''^"''' 
and poe"''''^" '', for some constant d. Consequently, the norming constant satis- 
fies le"'^^^') - 1| ^ £ri,/3, and hence ||po - PJ„,^,eJ|oo < £n,i3- Because po is 
bounded away from zero and infinity, this implies that pj^ ^^g. is in the KuUback- 
L eibler neighbourhood Bn.sipsn.s) of po, for some constant D (cf. Lemma 8 
in I Ghosal and van der VaartI |2007bl |). Because Iln,(3 is the uniform measure on 



the Nn,/3 log spline densities of this type, it follows that H„_^(i?„^/3(£)£„_^)) > 
^n^p ^ exp[— Fne'^ for some positive constant F. 
In view of ()2.8p it follows, for any e. 



Xn,(3 'nn,f3{Bn,l3{Den,f3)) ~ Xn,f3 fJ'/3 

Define the sets of indices a < (3 and a > /? as in Theorem l2.11 relative to a given 
constant H . Thus a < (3 is equivalent to Jn,a > HJn,p and hence the sum over 
a < /3 of the preceding display can be bounded above by 
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The leading term is o(e^ " z^) provided H is big enough, and the sum is 
bounded by assumption. Thus (j2.4|) is fulfilled for any constant B. Further- 
more, condition (I2.2|l holds trivially with /i„_ct — (/^Q//^;3)e~'^'^" °''^. Condition 
(j2.3p clearly holds for sufficiently large i, with the same choice of fJ.n,a and any 
L > 0. 

The theorem is a consequence of Theorem 12. 11 with An of this theorem equal 
to the present A. m 



6. Proof of the main theorems 



We start by ext e nding results from Ghosal and van der Vaart 2007b| |. 
Ghosal et all [200(11 . iLeCamI |l973l | and iBirgel |l983l | on the existence of tests 



of certain tests under local entropy of a statistical model. The results differ 
from the last three references b y inclusion of weights a and (3; relative to 
iGhosal and van der VaartI |2007bt the difference is the use of local rather than 
global entropy. 

Let d be a metric which induces convex balls and is bounded above on V by 
the Hellinger metric h. 

Lemma 6.1. For any dominated convex set of probability measures V , and any 
constants a, /3 > and any n there exists a test (j) with 



sup 

Qev 



(«Po> + /3Q"(l-0) 



< ^/a(3e 2 



nh^Po^V) 



Proof. This follows by minor adaptation of a result of Le Cam |l98d |. The 
essence is that, by the minimax theorem. 



inf supfaPo"'/' + /3Q"(l 

4' Q£V^ 



sup inf(aPo"0 + /3O«(l-0)) 

Q„econv (P") ^ ^ 



sup 

Q„Gconv (P" 



aPoiap^ < (3qn) + l3Qn{apo > /3g„) 



< sup I a 

Q„econv (P") 



< sup y af3 

Q„Gconv (P'>) 



Next we use the convexi ty of V to see that this is bounde d above by (see Le CamI 
|l986l | or Lemma 6.2 in Kleijn and van der VaartI 2006l |) 

\/a^(sup / VPoV? 
^QevJ 

Finally we express the affinity / y/pa^in the Hellinger distance as 1 — i/i^(Po, Q) 
and use the inequality l — x^e^"^, fora;>0. ■ 
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Corollary 6.1. For any dominated set of probability measures V with d{Po, V) > 
3e, any a, P > and any n there exists a test (j) with 



Proof. Choose a maximal 2£-separated set V' of points in V. Then the balls 
Bqi of radius 2e centered at the points in V' cover V, whence their number 
is bounded by N{2e,'P,d). Furthermore, these balls are convex by assumption, 
and are at distance 3e — e = 2e from Pq • The latter is true both for the distance 
d and the Hellinger distance, which is larger by assumption. For every ball Bqi 
attached to a point Q' € V' there exists a test luqi with the properties as in 
Lemma 16.11 with V taken equal to Bqi . Let (jj be the maximum of all tests 
attached in this way to some point Q E V' . Then 



The right sides can be further bounded as desired. ■ 

Lemma 6.2. Suppose that for a dominated set of probability measures V, some 
nonincreasing function e i— > N{e), some So > 0, and for every e > eo, 



Then for every e > Eq and every a, /3 > there exists a test (p (depending on e 
but not on i) such that for every i £ N, 



Proof. For j G N let Vj — {p £ V'.je < d{p,po) < {j + l)e}. Because the set 
Vj has distance 3(j£/3) to po, the preceding corollary implies the existence of a 
test with 







sup 

p£V:d{p,po)>i6 
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We define (p as the supremum of all test (f>j for j e N. The size of this test 
is bounded by y^(3/aN{e) X^jeN exp[— The power is bigger than the 
power of any of the tests cj)j . m 

Lemma 6.3 (Lemma 8.1 in iGhosal et al. 2000|)- For every e > and prob- 
ability measure II we have, for any C > and -B(e) = {p € V: Po\ogpo/p < 

e^Po(logPo/p)'<e'}, 



/n 
n-(^.)dn(P)<n(i3(£))e-(i 



+C)nE'-'\ <- 1 



Proof of Theorem 12. II Abbreviate Jn,a = ne^ ^, so that the constant E defined 
in the theorem is given by i? = sup^>^^ EaJn,a/ Jn,f3„- 

For a > /3„ we have Ben,i3„ > B /^/Hen.a > £n,a- Therefore, in view of the 
entropy bound (|2.ip and Lemma [6.21 with e — Ben,p„ and logA^(e) — EaJn,a 
(constant in e), there exists for every a > /?„ a test with, 

p(B„J„.„--R'bV„,0„) 

P"6 < /Tj < /Tj p(^-^s (6 1) 

^0 rn.Q ^ V l — Q-KB^J„i3 ^ V ""."'^ ' \'^-^) 

sup P"(l-(/'„,a) < e-^^'''-^-''". (6.2) 

p6T'n,Q:rf(p,Po)>iS£„.a„ V Mn,Q 



pn 
^0 



For a < /3„ we have that > v Hen.i3„ and we cannot similarly test balls of 
radius proportional to £n.fj„ in Vn,a- However, Lemma 16.21 with e = B'en,a and 
B': = B/\fH > 1, still gives tests 4>n,a such that for every i G N, 

i^n,a t^n,a ~ —KB'^ J ° ^ " ° '~ V Mri,C(6^ ^ (6.3) 

sup P"(l - (/.„,„) < ^^e-^^"^'-^"-. (6.4) 

pe7'„,c,:d(p,po)>i-B'eii,c« V^"'" 

Let 0„ = sup^.g^^ </'n,a be the supremum of all tests so constructed. 

The test 4>n is more powerful than all the tests 4>n,on and has error of the first 
kind Pq4>7i bounded by 

for c = {K B^ — E) A{K B^ — E_H) , which is bigger than 1 by assumption. Because 
Jn,Pr^ ~^ ^ and J2aeA„ \/WnIa < cxp J„^^^ , this tends to zero. Consequently, 
for any IB, 

Po"n„(p:d(p,po) > /Se„,^jXi, . . . ,X„)0„ < Po"0„ -> 0. (6.5) 
We shall complement this with an analysis of the posterior multiplied by 1 — (pn ■ 
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By Lemma 16.31 there exist events An with probabiUty PQ{An) > 1 — 
(ne^ /3 ^ 1 on which 

> e-2'^-^"A„,/3„n„,^„(B„,ft^(e„,/3j). (6.6) 

Define for i e N, 

Sn^a.i = {P e VnW-iB'Sn^a < d{p,po) < {i + l)B'en,a}, Ct < /?„, 
Sn,a.i = {P e Vn,a-iBen,f3„ < d(p,Po) < (« + l)Ben,p„}, a > /3„. 



Then 



{p:d{p,pQ) > IBsn^is,,} 

C U U ^»."'« U U ^ < '^(^''Po) < IB'en.c.} 



a i>I a<f3„ 



C y y 5„,„,, y y C.n^a{lB'en,a). 
a i>I a</3„ 

By Fubini's theorem and the inequahty Po{p/po) < 1, we have for every set C, 
Jct\Po 

Po" / - (/'„)dn„,„(p) < sup P"(i - </.„)n„,„(c). 

Combining these inequalities with (|6.6p . (|6.2p and (|6.4p we see that, 
P^'Hn [d{p,po) > IBenji^ . . . , X„) (1 - 0„)U„ 

~ aeA~^>/3„ ^"''3" e-2J„,^„n„,;3„(B„,;3„(e„,0j) VM^i;^ 

X p-ifS'2l^J„,„TT (Q .\ 1 

{Cn,a{IB'en,a)) 

aeA~^c.<0„ ^"A. e-2^".'^n„,0„(B„,/3„(£„,^J)- 

The third term on the right tends to zero by assumption (|2.4p . since -B' = 
B/^/H < B. We shall show that the first two terms on the right also tend to 
zero. 
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Because for a > /3„ and for i > / > 3 we have Sn,a.i C Cn,a{'\f^iBen,p„)i the 
assumptions (j2.3p shows that the first term is bounded by 

e-^^''' n„,„(C„,„(\/2iS£„,0 J) 1 



y , 

aGA„:Q>/3„ i>/ 



< 

C(GA„:a>/3„ 



Because X^aeA \fWn^ < exp J„./3„ by assumption, this tends to zero if {K — 
2L)I^B^ > 3, which is assumed. 

Similarly, for a < (3n the second term is bounded by, in view of (|2.2p . 

qGA„:q</3„ i>I 

g(2L-A:)S'^J„,Q/^ 

- E VM"."^^"^"'"" _ g(2L-i<')B'2J„,„ • 
aeA„:Q</3„ 

Here Jn,a > HJn.fj,^ for every a < /?„, and hence this tends to zero, because 
again {K - 2L)B^P > 3. ■ 



Proof of Theorcm \2.2[ We follow the line of argument of the proof of Theo- 
rem [2lTl the main difference being that presently Jn,f3„ = 1 and hence does not 
tend to infinity. To make sure that PQ(j>n is small we choose the constant B suf- 
ficiently large, and to make PJ'(^„) sufficiently large we apply Lemma lOl with 
C a large constant instead of C = 1. This gives a factor e~^^^^^"'"-i^'^ instead of 
g-2j„_^^ in the denominators of (|6.7p . but this is fixed for fixed C. The argu- 
ments then show that for an event An with probability arbitrarily close to 1 the 
expectation PJTI„((i(p,po) > ^-Ben,/3„ l-'^i, • • ■ ,^n)^An can be made arbitrarily 
small by choosing sufficiently large / and B. This proves the theorem. ■ 

Proof of Theorem \3.1[ The second assertion of the theorem is an immediate 
consequence of Theorems 12.11 and 12.21 These theorems show that the posterior 
concentrates all its mass on balls of radius BIen,f3„ or In£n,(3„ around po, re- 
spectively. Hence the posterior cannot charge any model that do not intersect 
these balls. 

The first assertion can be proved using exactly the proof of Theorems 12.11 
and 12.21 except that the references to a > /?„ can be omitted. In the notation 
of the proof of Theorem 12.11 we have that 



U ^"^"^ U U^«".»U U Cn,c.{lB'en,o) 
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It follows that PoIln{An^<f3^\Xi, . . . ,Xn){l — 0n)l>t„ Can be bounded by the 
sum of the second and third terms on the right side of (|6.7p . which tend to zero 
under the conditions of Theorem 12.11 and can be made arbitrarily small under 
the conditions of Theorem 12.21 by choosing B and/or / sufEcently large. ■ 



7. Technical proofs and complements 



In this section we list technical lemmas on approximation by spline spaces. 
We let II/II2 and ||/||oo be the L2[0, 1] and the supremum norm of a function 
/: [0, 1] — > R, and similarly write ||^||2 and \\0\\oo for the Euclidean and maximum 
norm of 6* € R"^. Let be a norm for C"[0, 1], for instance, 



l/IU = ll/llc 



sup 



Lemma 7.1. Let q > a > 0. There exists a constant Cq^a depending only on q 
and a such that, for every f in C"[0, 1], 



< 



inf 0^ Bj-fl^<Cq 



Lemma 7.2. For any 6 e R'^ , 

PWoo < \\O^Bj\ 

Lemma 7.3. For any 6 e such that O'^l = 0, 

CM\oo < lllogpj.elloo <C4\\e\\oo. 

Lemma 7.4. For every 61,62 <E M"' such that - 62) = 0, 



<VJ\\6'^Bj\\2< 



X.O.J 



J 



^ 1 ) ^ h^{pj,ei,PJ,e2) ^ sup pj^ 



x.,e,j 



^(^)(- 



J 



where the infimum and supremum are taken over all 6 on the line segment 
between 61 and 62 and all x G [0, 1] . 

Lemma 7.5. Let q> (3. If log po E C^[0, 1], then the minimizer 6j of 6 ^ 
II \ogpjfi — log Pol 1 00 over 6 E M"' with 6^1 = satisfies 

Hpj,Sj,Po) < II ^ogpjgj - logpolloo < J^'^- 

The first lemma in this list is the basic approximation lemma for splines 
and shows that splines of sufficient dimensi on are well suit ed to approximating 
smooth functions. Its proof can be found in | de B0011 200l| . pl70. Lemmas 17.21 
7.41 are (part l y) ini plicit in Stond 19861 199Cll | and can be explicitly found in 
Ghosal et all j2000j . The equivalence of the L2-iiorm or infinity-norm on the 
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linear combinations of splines and the Euclidean or maximum norm on the 
coefficients (up to constants) given by Lemma [7^ are consequences of using the 
B-splines, with their special properties, as a basis. Lemmal7.5lis a consequence 
of the other lemmas; a proof can be found in iGhosal et all [2003| . 



For given M > let Qj = {9 e [-M, M]-': 6*^1 = 0, ||6l|loo < M}, and write 
Vj for the set of functions pj^e- By Lemma [7.31 the densities pj^g with 6 S &j 
take their values in the interval [e"*-^"^^, e*^**^]. In particular, they are uniformly 
bounded away from zero and infinity. Assume that the true density po is also 
bounded away from and infinity. 

In the present case the neighbourhoods Bn.ai^) and C„^Q,(e) defined in (|I.3p 
take the forms Bj{e) and C,j{e) for J — Jn.a and 

Bj{e) - {pj,e: Po log — < £^ Po (log —)<e^9eQj}, 
^ PJfi V pjfi^ J 

Cj{e) = [pj,9-h{pj^0,pa) <e,9e Sjj. 

Because the quotients po/pj.9 are uniformly bounded above by eKjp{C^M), there 
exists a constant 1 < B, depending on AI only, such that 

Bjie) c Cjie) c BjiBs). (7.1) 

(In fact B a multiple of M does; see e.g. Lemma 8 in Ghosal and van der Vaar^ 



2007^ .1 In order to verify the conditions of the main theorems involving the 



sets Bn,a{£) or C„,a(e), we can therefore restrict ourselves to the Hellinger balls 
Cn,a{£)- These Hellinger balls can themselves be related to Euclidean balls. 

Lemma 7.6. // 9j minimizes the map 9 i—f h(pj^g,po) over Qj and sj = 
h{po,pjfi ,), then there exist constants F_ and F such that 

Cj(e) c {pxei^e ej,Z||0-0j||2 < \/J2£}, 2e<F, (7.2) 
{pjfi-9 ^QJ^\9-9J\\2<^/Je] ciCj{F2e), Fe>ej. (7.3) 

Proof. By Lemma [73] there exist constants F_< F, only depending on AI, such 
that, for every 9 E 8,/, 



F{P- 9j\\2 aVJ)< Vjhipj,g,pj^g,) < F\\9 - 



'J\\2- 



(In fact multiples of F = e"*^**^ and F = e*^**^ will do.) The set Cj{e) is 
empty for e < ej. Therefore, if pj^g G Cj{e), then e > ej and by the triangle 
inequality h{pj^g,pj,gj) < 2e. If also 2e < F, then the preceding display shows 
that F_\\9 — 9j\\ < \fl2e. This and a similar argument for an inclusion in the 
other direction, yields the lemma. ■ 

Lemma 7.7. There exists a constant E ^ Em such that log D(^e/ 10, Cj{e),h^ < 
EJ, for every e > 0. (In fact E = AI exp(C4M) does.) 
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Proof. If 2e < F_, then (j7.2p shows that Cj(e) is included in the set of aU pj g 
with 9 e Cj{e) = {6* G ej:^;||6' - 6',7||2 < 8%/J2e}. For given c> there exists 
a constant C such that we can cover Cj{e) with C'^ EucUdean baUs of radius 
c^fje. In view of the second inequaUty of Lemma 17.41 this yields an 77-net over 
Cj{e) for the Hellinger distance, for 77 a multiple of exp(C4M/2)c£ . 

If 2e > F, then wc cover [-M, My with of the order {2M/cy balls of radius 
ce for the maximum norm. These fit in equally many Euclidean balls of radius 
c\/je, and yield balls of radius a multiple of exp(C4Af/2)ce in the Hellinger 
distance that cover Cj{e). ■ 

Lemma 7.8. If 9j minimizes 6 i— > h{po,pj^e) over 6 e Qj and logpo G C^[0, 1] 
with II logpolloo < Q^M for the constant in Lemma \7.3\ then h(j)j gj,po) < 

Proof. In view of Lemma 17.51 it suffices to show that 6j defined there satisfies 
PjIIoo < M. By the triangle inequality, C4|l^,7||oo < ||logpje^ - logpolloo + 
II logpolloo, where the first term on the right is of order 0{J~^) by Lemma 1731 



Lemma 7.9. // vj is the volume of the J-dimensional unit ball, then J 
\fj vj is increasing, and, as J —s- 00, 

r-J VJ''V^'' ^phTe 
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